text segmentation
Unsupervised Text Segmentation via Kernel Change-Point Detection on Sentence Embeddings
Jia, Mumin, Diaz-Rodriguez, Jairo
Unsupervised text segmentation is crucial because boundary labels are expensive, subjective, and often fail to transfer across domains and granularity choices. We propose Embed-KCPD, a training-free method that represents sentences as embedding vectors and estimates boundaries by minimizing a penalized KCPD objective. Beyond the algorithmic instantiation, we develop, to our knowledge, the first dependence-aware theory for KCPD under $m$-dependent sequences, a finite-memory abstraction of short-range dependence common in language. We prove an oracle inequality for the population penalized risk and a localization guarantee showing that each true change point is recovered within a window that is small relative to segment length. To connect theory to practice, we introduce an LLM-based simulation framework that generates synthetic documents with controlled finite-memory dependence and known boundaries, validating the predicted scaling behavior. Across standard segmentation benchmarks, Embed-KCPD often outperforms strong unsupervised baselines. A case study on Taylor Swift's tweets illustrates that Embed-KCPD combines strong theoretical guarantees, simulated reliability, and practical effectiveness for text segmentation.
- North America > United States > New Mexico > Doña Ana County > Las Cruces (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- (12 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- (2 more...)
BoundRL: Efficient Structured Text Segmentation through Reinforced Boundary Generation
Li, Haoyuan, Shen, Zhengyuan, Jeoung, Sullam, Chen, Yueyan, Li, Jiayu, Zhu, Qi, Wang, Shuai, Ioannidis, Vassilis, Rangwala, Huzefa
As structured texts become increasingly complex across diverse domains -- from technical reports to generative AI prompts -- the need for text segmentation into semantically meaningful components becomes critical. Such texts often contain elements beyond plain language, including tables, code snippets, and placeholders, which conventional sentence- or paragraph-level segmentation methods cannot handle effectively. To address this challenge, we propose BoundRL, a novel and efficient approach that jointly performs token-level text segmentation and label prediction for long structured texts. Instead of generating complete contents for each segment, it generates only a sequence of starting tokens and reconstructs the complete contents by locating these tokens within the original texts, thereby reducing inference costs by orders of magnitude and minimizing hallucination. To adapt the model for the output format, BoundRL~performs reinforcement learning with verifiable rewards (RLVR) with a specifically designed reward that jointly optimizes document reconstruction fidelity and semantic alignment. To mitigate entropy collapse, it further constructs intermediate candidates by systematically perturbing a fraction of generated sequences of segments to create stepping stones toward higher-quality solutions. To demonstrate BoundRL's effectiveness on particularly challenging structured texts, we focus evaluation on complex prompts used for LLM applications. Experiments show that BoundRL enables small language models (1.7B parameters) to outperform few-shot prompting of much larger models. Moreover, RLVR with our designed reward yields significant improvements over supervised fine-tuning, and incorporating intermediate candidates further improves both performance and generalization.
- North America > United States > North Carolina (0.04)
- North America > United States > New Mexico > Doña Ana County > Las Cruces (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (2 more...)
Consistent Kernel Change-Point Detection under m-Dependence for Text Segmentation
Diaz-Rodriguez, Jairo, Jia, Mumin
Kernel change-point detection (KCPD) has become a widely used tool for identifying structural changes in complex data. While existing theory establishes consistency under independence assumptions, real-world sequential data such as text exhibits strong dependencies. We establish new guarantees for KCPD under $m$-dependent data: specifically, we prove consistency in the number of detected change points and weak consistency in their locations under mild additional assumptions. We perform an LLM-based simulation that generates synthetic $m$-dependent text to validate the asymptotics. To complement these results, we present the first comprehensive empirical study of KCPD for text segmentation with modern embeddings. Across diverse text datasets, KCPD with text embeddings outperforms baselines in standard text segmentation metrics. We demonstrate through a case study on Taylor Swift's tweets that KCPD not only provides strong theoretical and simulated reliability but also practical effectiveness for text segmentation tasks.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > New Mexico > Doña Ana County > Las Cruces (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (13 more...)
Enhancing Retrieval Augmented Generation with Hierarchical Text Segmentation Chunking
Nguyen, Hai Toan, Nguyen, Tien Dat, Nguyen, Viet Ha
Retrieval-Augmented Generation (RAG) systems commonly use chunking strategies for retrieval, which enhance large language models (LLMs) by enabling them to access external knowledge, ensuring that the retrieved information is up-to-date and domain-specific. However, traditional methods often fail to create chunks that capture sufficient semantic meaning, as they do not account for the underlying textual structure. This paper proposes a novel framework that enhances RAG by integrating hierarchical text segmentation and clustering to generate more meaningful and semantically coherent chunks. During inference, the framework retrieves information by leveraging both segment-level and cluster-level vector representations, thereby increasing the likelihood of retrieving more precise and contextually relevant information. Evaluations on the NarrativeQA, QuALITY, and QASPER datasets indicate that the proposed method achieved improved results compared to traditional chunking techniques.
A Vietnamese Dataset for Text Segmentation and Multiple Choices Reading Comprehension
Hai, Toan Nguyen, Viet, Ha Nguyen, Xuan, Truong Quan, Minh, Duc Do
Vietnamese, the 20th most spoken language with over 102 million native speakers, lacks robust resources for key natural language processing tasks such as text segmentation and machine reading comprehension (MRC). To address this gap, we present VSMRC, the Vietnamese Text Segmentation and Multiple-Choice Reading Comprehension Dataset. Sourced from Vietnamese Wikipedia, our dataset includes 15,942 documents for text segmentation and 16,347 synthetic multiple-choice question-answer pairs generated with human quality assurance, ensuring a reliable and diverse resource. Experiments show that mBERT consistently outperforms monolingual models on both tasks, achieving an accuracy of 88.01% on MRC test set and an F1 score of 63.15\% on text segmentation test set. Our analysis reveals that multilingual models excel in NLP tasks for Vietnamese, suggesting potential applications to other under-resourced languages. VSMRC is available at HuggingFace
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Thailand > Bangkok > Bangkok (0.05)
- Asia > Vietnam > Hanoi > Hanoi (0.04)
- (16 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.69)
MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System
Zhao, Jihao, Ji, Zhiyuan, Fan, Zhaoxin, Wang, Hanyu, Niu, Simin, Tang, Bo, Xiong, Feiyu, Li, Zhiyu
Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline. This paper initially introduces a dual-metric evaluation method, comprising Boundary Clarity and Chunk Stickiness, to enable the direct quantification of chunking quality. Leveraging this assessment method, we highlight the inherent limitations of traditional and semantic chunking in handling complex contextual nuances, thereby substantiating the necessity of integrating LLMs into chunking process. To address the inherent trade-off between computational efficiency and chunking precision in LLM-based approaches, we devise the granularity-aware Mixture-of-Chunkers (MoC) framework, which consists of a three-stage processing mechanism. Notably, our objective is to guide the chunker towards generating a structured list of chunking regular expressions, which are subsequently employed to extract chunks from the original text. Extensive experiments demonstrate that both our proposed metrics and the MoC framework effectively settle challenges of the chunking task, revealing the chunking kernel while enhancing the performance of the RAG system.
Overview of the 2024 ALTA Shared Task: Detect Automatic AI-Generated Sentences for Human-AI Hybrid Articles
Mollá, Diego, Xu, Qiongkai, Zeng, Zijie, Li, Zhuang
While this By examining the accuracy of identifying AIgenerated collaboration offers exciting opportunities, it also sentences within texts that combine human introduces challenges -- particularly in distinguishing and AI-authored content, we aim to develop between human-authored and AI-generated more sophisticated and effective detection methods content within a single document.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Oceania > Australia > Victoria > Melbourne (0.05)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.05)
- (3 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.97)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.47)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)
Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception
Zhao, Jihao, Ji, Zhiyuan, Feng, Yuchen, Qi, Pengnian, Niu, Simin, Tang, Bo, Xiong, Feiyu, Li, Zhiyu
Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline, which impacts the quality of knowledge-intensive tasks. This paper introduces the concept of Meta-Chunking, which refers to a granularity between sentences and paragraphs, consisting of a collection of sentences within a paragraph that have deep linguistic logical connections. To implement Meta-Chunking, we designed Perplexity (PPL) Chunking, which balances performance and speed, and precisely identifies the boundaries of text chunks by analyzing the characteristics of context perplexity distribution. Additionally, considering the inherent complexity of different texts, we propose a strategy that combines PPL Chunking with dynamic merging to achieve a balance between fine-grained and coarse-grained text chunking. Experiments conducted on eleven datasets demonstrate that Meta-Chunking can more efficiently improve the performance of singlehop and multi-hop question answering based on RAG. For instance, on the 2Wiki-MultihopQA dataset, it outperforms similarity chunking by 1.32 while only consuming 45.8% of the time. Furthermore, through the analysis of models of various scales and types, we observed that PPL Chunking exhibits notable flexibility and adaptability. This is particularly relevant in knowledge-intensive tasks like open-domain question answering (Lazaridou et al., 2022). By integrating two key components: the retriever and the generator, this technology enables more precise responses to input queries (Singh et al., 2021; Lin et al., 2023). While the feasibility of the retrieval-augmentation strategy has been widely demonstrated through practice, its effectiveness heavily relies on the relevance and accuracy of the retrieved documents (Li et al., 2022; Tan et al., 2022). The introduction of excessive redundant or incomplete information through retrieval not only fails to enhance the performance of the generation model but may also lead to a decline in answer quality (Shi et al., 2023; Yan et al., 2024).
Recent Trends in Linear Text Segmentation: a Survey
Ghinassi, Iacopo, Wang, Lin, Newell, Chris, Purver, Matthew
Linear Text Segmentation is the task of automatically tagging text documents with topic shifts, i.e. the places in the text where the topics change. A well-established area of research in Natural Language Processing, drawing from well-understood concepts in linguistic and computational linguistic research, the field has recently seen a lot of interest as a result of the surge of text, video, and audio available on the web, which in turn require ways of summarising and categorizing the mole of content for which linear text segmentation is a fundamental step. In this survey, we provide an extensive overview of current advances in linear text segmentation, describing the state of the art in terms of resources and approaches for the task. Finally, we highlight the limitations of available resources and of the task itself, while indicating ways forward based on the most recent literature and under-explored research directions.
- Europe > United Kingdom > England > Greater London > London (0.04)
- Asia > Singapore (0.04)
- Europe > Bulgaria (0.04)
- (10 more...)
- Research Report (1.00)
- Overview (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
PODTILE: Facilitating Podcast Episode Browsing with Auto-generated Chapters
Ghazimatin, Azin, Garmash, Ekaterina, Penha, Gustavo, Sheets, Kristen, Achenbach, Martin, Semerci, Oguz, Galvez, Remi, Tannenberg, Marcus, Mantravadi, Sahitya, Narayanan, Divya, Kalaydzhyan, Ofeliya, Cole, Douglas, Carterette, Ben, Clifton, Ann, Bennett, Paul N., Hauff, Claudia, Lalmas, Mounia
Listeners of long-form talk-audio content, such as podcast episodes, often find it challenging to understand the overall structure and locate relevant sections. A practical solution is to divide episodes into chapters--semantically coherent segments labeled with titles and timestamps. Since most episodes on our platform at Spotify currently lack creator-provided chapters, automating the creation of chapters is essential. Scaling the chapterization of podcast episodes presents unique challenges. First, episodes tend to be less structured than written texts, featuring spontaneous discussions with nuanced transitions. Second, the transcripts are usually lengthy, averaging about 16,000 tokens, which necessitates efficient processing that can preserve context. To address these challenges, we introduce PODTILE, a fine-tuned encoder-decoder transformer to segment conversational data. The model simultaneously generates chapter transitions and titles for the input transcript. To preserve context, each input text is augmented with global context, including the episode's title, description, and previous chapter titles. In our intrinsic evaluation, PODTILE achieved an 11% improvement in ROUGE score over the strongest baseline. Additionally, we provide insights into the practical benefits of auto-generated chapters for listeners navigating episode content. Our findings indicate that auto-generated chapters serve as a useful tool for engaging with less popular podcasts. Finally, we present empirical evidence that using chapter titles can enhance effectiveness of sparse retrieval in search tasks.
- North America > United States > Idaho > Ada County > Boise (0.05)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- (10 more...)
- Media (0.36)
- Leisure & Entertainment (0.36)